-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add jaeger/opentracing tracing to m3query #1321
Conversation
d17e4a7
to
4ca756b
Compare
Codecov Report
@@ Coverage Diff @@
## andrewmains12/cost_accounting #1321 +/- ##
===============================================================
Coverage ? 64.3%
===============================================================
Files ? 771
Lines ? 64218
Branches ? 0
===============================================================
Hits ? 41333
Misses ? 20009
Partials ? 2876
Continue to review full report at Codecov.
|
src/query/DATAMODEL.md
Outdated
@@ -0,0 +1,5 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops this was pretty much just my own notes--removing.
src/query/executor/request.go
Outdated
func (r *Request) execute(ctx context.Context, pp plan.PhysicalPlan) (*ExecutionState, error) { | ||
sp := startSpan(r.engine.metrics.executingHist, r.engine.metrics.executing) | ||
func (r *Request) generateExecutionState(ctx context.Context, pp plan.PhysicalPlan) (*ExecutionState, error) { | ||
sp, ctx := opentracing.StartSpanFromContext(ctx, "generateExecutionState") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the naming semantics for opentracing, but do they recommend camel case or using _
or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yuri says _
--I'll change it. Good catch!
4ca756b
to
aef38cb
Compare
aef38cb
to
1b116c3
Compare
src/query/executor/transform/exec.go
Outdated
ProcessBlock(queryCtx *models.QueryContext, ID parser.NodeID, b block.Block) (block.Block, error) | ||
} | ||
|
||
// ProcessSimpleBlock is a utility for OpNode instances which simply propagate their data after doing their own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment!
|
||
func TestProcessSimpleBlock(t *testing.T) { | ||
t.Run("closes next block", func(t *testing.T) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I should have filled these tests in before pushing. Doing that now.
src/query/functions/fetch.go
Outdated
// Execute runs the fetch node operation | ||
func (n *FetchNode) Execute(queryCtx *models.QueryContext) error { | ||
ctx := queryCtx.Ctx | ||
blockResult, err := n.fetch(queryCtx.Ctx, queryCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ctx
instead of queryCtx.Ctx
?
07b3f36
to
cbb2e6e
Compare
Error: respErr.Err.Error(), | ||
}, | ||
RqID: logging.ReadContextID(ctx), | ||
}, logging.WithContext(ctx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to log id in the error returned to the client? The failing request ID is still logged since the logger has the context id on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. So the idea here is to help aid in debugging; by returning the request id to the client, we make it easy to tie an error to logs + traces. Basically, we can just ask the user for the RqID from grafana's query inspector, and go directly from that to traces + logs.
Does that seem reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, doesn't seem like it would cause issues. We've only got tracing on read right? Might want to add the same thing to read_instantaneous
, too... Would it make sense to take this block and put it into xhttp
as a method like WriteJSONResponseWithRequestID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good call, I shouldn't have been lazy about it here :). I'll factor it out.
@@ -88,6 +88,10 @@ type countValuesNode struct { | |||
controller *transform.Controller | |||
} | |||
|
|||
func (n *countValuesNode) Params() parser.Params { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this function (and other snowflakes existing outside of various base.go
files) be refactored to the ProcessBlock(queryCtx *models.QueryContext, ID parser.NodeID, b block.Block)
model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good call, missed these. This is one reason it's too bad we can't put tracing etc into Controller
(at least, not without having weird traces); it makes the fanout of this change (and potentially other similar ones) much wider. That said I don't have a good alternative model immediately handy so I'll go through and do the switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a bit annoying unfortunately; I guess it's a downside that we have to deal with in return for the nice properties of functions calling their own downstream processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for a brief moment I was like "we should just explicitly walk the graph instead" and then I realized that would be a large rewrite 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, was looking at making a few nicety fixes to DAG generation/execution/mapping, and it's a pretty hefty beast unfortunately to make it more usable for things like this (although it would definitely not go astray!)
src/query/executor/transform/exec.go
Outdated
ProcessBlock(queryCtx *models.QueryContext, ID parser.NodeID, b block.Block) (block.Block, error) | ||
} | ||
|
||
// ProcessSimpleBlock is a utility for OpNode instances which simply propagate their data after doing their own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this describe every block type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yeah true. The "exception" is nodes which need to wait for multiple blocks before actually executing, e.g. binary nodes. I'll update the comment to clarify that.
"github.com/m3db/m3/src/query/util/opentracing" | ||
) | ||
|
||
type simpleOpNode interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit weird having an interface that a bunch of nodes satisfy be private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is definitely a golang quirk, but I'd argue that having it be private makes reasonable sense. I don't necessarily want other packages using the interface directly; its only reason for existence is ProcessSimpleBlock
. I would even make it anonymous if that were a thing go supported.
We can also always make it public later :)
require.NoError(t, doCall(tctx)) | ||
}) | ||
|
||
t.Run("errors on process error", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same comment about the t.Run
, seems to swallow up useful information for failing tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, mostly had problems with it in tabled tests, when iterating over them... when one of twenty fails it's sometimes a pain to find exactly which test failed exactly where.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, totally agreed on that! I always try to use searchable names in those cases, which helps, but it can still be annoying.
cbb2e6e
to
3d1ac04
Compare
src/query/executor/transform/lazy.go
Outdated
@@ -32,6 +33,12 @@ type sinkNode struct { | |||
block block.Block | |||
} | |||
|
|||
// Params for sinkNode returns an params object for lazy sinks. It doesn't appear in queries (since sink nodes are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this one may have fallen through the cracks (thanks github for collapsing comments for some reason 🙄 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good catch--lint was complaining at me earlier about not having a comment on this; turns out that's because I don't actually need it to implement the interface. Removed.
return transform.ProcessSimpleBlock(n, n.controller, queryCtx, ID, b) | ||
} | ||
|
||
func (n *takeNode) ProcessBlock(queryCtx *models.QueryContext, ID parser.NodeID, b block.Block) (block.Block, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -142,6 +147,14 @@ func (n *baseNode) Process(queryCtx *models.QueryContext, ID parser.NodeID, b bl | |||
return n.controller.Process(queryCtx, nextBlock) | |||
} | |||
|
|||
func (n *baseNode) processWithTracing(queryCtx *models.QueryContext, lhs block.Block, rhs block.Block) (block.Block, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but this will show up on the span as two function calls, one of which will be pretty much instantaneous, right? Should be fine, and probably not worth overengineering here to avoid that case (like starting the span only when both LHS and RHS satisfied), but just trying to get an idea of how it will look :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, it might be necessary to add a parent span to NewScalarOp
, since there is a potential (highly unlikely, but still) scenario where you have a binary function between a series and a scalar, e.g. up + 1
, and somehow the up
block resolves first, which may cut off your trace early if I'm understanding the flow correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
blocks, err := c.processCompletedBlocks(queryCtx, processRequests, maxBlocks) | ||
defer closeBlocks(blocks) | ||
|
||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So probably(definitely!) out of scope for this PR, but would be good to revisit temporal functions at some point and add in some logic that if any incoming block errored, it should drop any incoming blocks with the same error instead of trying to add them and starting processing before hitting it... May add an issue regarding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! What do you mean by "any incoming block errored"? As in, there was an error processing it or an error upstream? I'm definitely not opposed to exiting earlier when possible though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general function of temporal functions is that blocks come in, and are added to a cache. When enough blocks come in to fully calculate a downstream block, it is processed (which can error). At the moment, if the processing fails, that particular request will fail but there will still be in-flight blocks coming into the temporal function, which do not necessarily need to be computed since the entire query will error anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok cool, thanks for the clarification. That seems like a kind of generic problem--other parts of the query should be cancelled once one part fails. Do we have any general cancellation mechanism already? I suppose we could check the context periodically during each node's Process
; that would probably be the idiomatic way to do it. No idea how often "periodically" should be though haha.
3d1ac04
to
207d695
Compare
35d47f1
to
9d59070
Compare
207d695
to
004b1e5
Compare
9d59070
to
add5124
Compare
004b1e5
to
cb0d32f
Compare
cb0d32f
to
91185cf
Compare
7a885dc
to
fc0609e
Compare
a2b2d93
to
1275da2
Compare
fc0609e
to
0d0b885
Compare
Adds a `queryContext` argument to `OpNode.Process` to hold any per query state. I use this in both my [cost accounting](#1207) and [tracing](#1321) PR's. At the time, I based my tracing branch off of the cost accounting branch. Tracing is closer to landing though, so I've now factored out the common changes, and rebased them both against this branch.
0d0b885
to
0cf646b
Compare
0cf646b
to
cdd4a00
Compare
@@ -167,6 +170,9 @@ func (h *PromReadHandler) ServeHTTPWithEngine( | |||
|
|||
result, err := read(ctx, engine, h.tagOpts, w, params) | |||
if err != nil { | |||
sp := opentracingutil.SpanFromContextOrRoot(ctx) | |||
sp.LogFields(opentracinglog.Error(err)) | |||
opentracingext.Error.Set(sp, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you have sp.Finish()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, no; that function is intended to extract the span from context if it exists, and return a noop/dummy if not. In the first case, the calling function is already in charge of calling Finish()
; in the second, it shouldn't be needed. Does that make sense?
I updated the function to be opentracingutil.SpanFromContextOrNoop
, making it return a noop span, which should ensure that it doesn't need to be closed even if called with a context without a span.
docs/operational_guide/monitoring.md
Outdated
insight into query performance and errors. | ||
|
||
### Configuration | ||
Currently, only https://www.jaegertracing.io/ is supported as a backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe only [Jaeger](https://www.jaegertracing.io/) is supported as a tracing backend.
return sp | ||
} | ||
|
||
return opentracing.StartSpan("SpanFromContextOrRoot - dummy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this string appear re: trace? Could be good to make it a constant if nothing else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good question. So in properly configured production usage, this branch will never be hit; the general expectation is that functions calling this aren't at the root level, and will already have a span passed in. Having the dummy span case instead of panicking is basically a convenience for tests and callers who don't really care about tracing. Does that make sense?
One other thing I could do is have this function take in the root name as an argument instead--that might help folks track down any misconfigured call paths if they're hooking up tracing.
src/query/util/httperrors/errors.go
Outdated
RqID string `json:"rqID"` | ||
} | ||
|
||
// ErrorWithReqID writes an xhttp.ErrorResponse with an added request id (RqId) field read from the request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit1: could be better to call it ErrorWithReq
rather than specifying it must be a reqID?
nit2: Split the NB out
// context.
//
// NB: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.e. 1--sure, changed it to ErrorWithReqInfo
(more generic + still accurate)
nit2: Split the NB out
Sure, done.
docs/operational_guide/monitoring.md
Outdated
|
||
## Logs | ||
|
||
TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to remove these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh I don't see much harm in having stubs. I switched it to "TODO: document how to retrieve metrics for M3DB components." so that people don't think the metrics themselves are TODO.
src/query/api/v1/httpd/handler.go
Outdated
@@ -139,6 +136,23 @@ func NewHandler( | |||
return h, nil | |||
} | |||
|
|||
func applyMiddleware(base http.Handler) http.Handler { | |||
rtn := http.Handler(&cors.Handler{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better var name?
0ae9247
to
3989018
Compare
This diff adds opentracing support to m3query. Spans are started for every HTTP request (courtesy middleware from
github.com/opentracing-contrib/go-stdlib
). I then have child spans started for compilation + execution, with drilldowns into each node of the query. Here's a sample trace:Query is
sum(increase(coordinator_engine_datapoints{type="fetched"}[10s]))
One implementation note: node instrumentation is a bit less clean than I would have liked; since all nodes call their children before exiting, i.e.:
I instrumented them like:
The alternative would be to do:
but that gives unnecessary nesting imo: